Linear Models

Barcelona Data Insitute Aleix Ruiz de Villa

Introduction

Who am I?

Aleix Ruiz de Villa

  • PhD in mathematics
  • Head of data science of LaVanguardia.com, SCRM (Lidl), Onna
  • Cofounder of Barcelona R Users Group (2011) and Barcelona Data Science and Machine Learning Meetup (2014)
  • Currently: Sabbatical year + deepening in Causal Inference

Why linear models?

About this course

Part I: Prediction

Estimate the value of sales in an unobserved TV investment quantities.

Part II: Causal Inference

How much does TV investment affect sales?

Linear Models

Reference Book

Introduction to Statistical Learning

Warm up exercises

  1. Download and read (using read.csv) the Advertising.csv data set (from webpage>Data Sets)
  2. Consider the line \(Sales = 15 + TV \times 0.01\). Calculate (program) its value at \(TV=44.5\) and at \(TV=0\)?
  3. Plot \(TV\) vs \(sales\) (using plot function). Add the previous calculated point (using points function)
  4. Calculate the error between the previous prediction and the real value
  5. Add all points in the line \(Sales = 15 + TV \times 0.01\) in the previous plot
  6. Calculate the errors in all \(TV\) points in the data set with respect the line \(Sales = 15 + TV \times 0.01\).
  7. Which is the sum of those errors?
  8. How much Sales increase, when I spended one more unit of TV advertisement in this line?

Best line in R

linear_model <- lm(sales ~ TV, ads)
summary(linear_model)
## 
## Call:
## lm(formula = sales ~ TV, data = ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Some questions

  • Which are the coefficients? What is the interpretation of each one?
  • What do you think Residuals may mean?
  • What is Std. Error? How can we increase/decrease it?

Prediction

Which is the predicted value at \(TV=44.5\)?

df <- data.frame(TV=44.5)
predict(linear_model, df)

Exercise: Calculate the Mean Squared Error

A little bit of theory…

Given a data set \(\{x_i, y_i\}\) we want to find the line that minimizes the sum of errors

\[S(a, b) = (y_1 - (a + b x_1))^2 + ... + (y_n - (a + b x_n))^2 = \sum_i (y_i - a - b x_i)^2\] Question: In the formula above, which are the unknowns?

Least Squares problem:

\[\min_{a, b} \sum_i (y_i - a - b x_i)^2 \]

Least Square formula:

Consider \(\bar x = \mbox{mean}(x)\) and \(\bar y = \mbox{mean}(y)\):

Slope

\[\hat b = \frac{Cov(x, y)}{Var(x)} = \frac{\sum_i (x_i - \bar x)(y_i - \bar y) }{\sum_i (x_i-\bar x)^2} \]

Intercept

\[\hat{a} = \bar y - \hat b \bar x\]

Exercise:

  • At point \(x^* = \bar x\) what is the prediction done by the linear regression model?
  • Solution: \(y(\bar x) = \hat a + \hat b x^* = \bar y - \hat b \bar x + \hat b \bar x = \bar y\)

Correlation and linear regression

\[r_{xy} = \hat b \frac{\sqrt{Var(x)}}{\sqrt{Var(y)}} = \frac{\sum_i (x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum_i (x_i - \bar x)^2 }\sqrt{\sum_i (y_i - \bar y)^2 }}\] where \(-1\leq r_{xy} \leq 1\).

  • If \(x, y\) are independent \(\implies r_{xy} = 0\)
  • If \(|r_{xy}| = 1 \implies y\) lives in a linear function of \(x\).

Checkpoint !!!

Summary:

  • When are linear models useful?
  • How do we derive them?
  • How sample size affects?
  • How do we interpret the coefficients?

Questions:

  • Which extensions of linear models can you think of?
  • Which applications of linear models can you think of?

Multilinear regression

When does this happen?

What if we want to take into account more information?

  • radio advertisement?
  • newspapers advertisement?

Questions: What advantages do you expect by adding more variables?

Multilinear model

\[ales = \alpha_0 + alpha_1 \times \mbox{TV} + \alpha_2 \times \mbox{radio} + \alpha_3 \times \mbox{ newspapers}\]

Questions

  • How can we plot it?
  • How do we interpret each of the coefficients?

Plot

Data Structure

Covariates X (matrix)

ads[, 2:4] %>% head
##      TV radio newspaper
## 1 230.1  37.8      69.2
## 2  44.5  39.3      45.1
## 3  17.2  45.9      69.3
## 4 151.5  41.3      58.5
## 5 180.8  10.8      58.4
## 6   8.7  48.9      75.0

Response Y

head(ads[, 'sales', drop=FALSE])
##   sales
## 1  22.1
## 2  10.4
## 3   9.3
## 4  18.5
## 5  12.9
## 6   7.2

Matrix formulation

\[\left[ \begin{array}{ccc} intercept & TV & radio & newspapers \\ 1 & 230.1 & 37.8 & 69.2 \\ 1 & 44.5 & 39.3 & 45.1 \\ ... & ... & ... & ... \end{array} \right] \left[ \begin{array}{c} \alpha_0 \\ \alpha_1 \\ alpha_2 \\alpha_3 \end{array}\right] = \left[ \begin{array}{c} sales \\ 22.1 \\ 10.4 \\ ... \end{array}\right]\]

or equivalently

\[X \alpha = Y\]

Least Squares problem:

\[\min_{\alpha_0, \alpha_1, \alpha_2, \alpha_3} \sum_i (y_i - (\alpha_0 + \alpha_1 \times \mbox{TV}_i + \alpha_2 \times \mbox{radio}_i + \alpha_3 \times \mbox{ newspapers}_i))^2 \]

Matrix formula

\[\hat \alpha = (X^tX)^{-1}X^t Y\]

Exercise

  1. Calculate the linear model with lm function
  2. (Advanced) Calculate the coefficients with the formula above. You will need functions as.matrix, solve, t and matrix multiplication %*%.
  3. Interpret each coefficient

Checkpoint!!

Summary:

  • We can include more variables to improve prediction accuracy
  • Multilinear models solve the same problems as linear models: Least Squares
  • In many programming libraries data is passed in matrices \(X\) and \(Y\)

Questions:

  • Which extensions of multilinear models can you think of?
  • Which problems do you think we will find?

Working with multilinear models

Coefficient effect

  • What if TV were measured in a different unit (currency, scale, …)?
  • \(\implies\) coefficients with different measurement units cannot be compared

Variable normalization

  • Step 1: \(x' = x - \bar x\)
  • Step 2: \(x' = (x - \bar x)/\sigma(x)\)
  • Result: \(\bar {x'} = 0\) and \(\sigma(x) = 1\)
  • Exercise: Normalize variable \(TV\)

Transformations of variables

Even the function is not linear in \(x\), it is linear in the coefficient! \[ y = 2x^2 + \varepsilon\]

Transformations of variables

x = seq(0, 20, 0.2)
y = 2*x^2 + rnorm(length(x), 0, 5)
x2 = x^2

summary(lm(y~x2))
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2072  -3.1011   0.7911   3.2493  10.7577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5834     0.7930  -0.736    0.464    
## x2            2.0006     0.0044 454.697   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.329 on 99 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 2.067e+05 on 1 and 99 DF,  p-value: < 2.2e-16

The same works for any transformation of the variables: \(log\), handcrafted transformations, multiplication of variables, ….

Putting redundancy in your model

Exercise:

  1. Build a new variable TV2 exactly equal to TV and perform a linear model including both. What happens? Why?
  2. Build a new variable TV2 which is TV plus some noise (use rnorm with sd levels \(0.5, 2, 5, 20\)) and perform a linear model including both. What do you observe? Can you explain it?

Multicolinearity

  • When some covariable can be modelled linearly with repect the other covariates \(\implies\) increasing confidence intervals or possible numerical errors.
  • Typically some variables are removed. But which ones?

Categorical variables

  • Imagine you have socioeconomic variable with values: High, Medium, Low. How would you include this variable in the model?
  • Can you use a vairable with values \(1,2,3\)?

Basic solution

\[\left[ \begin{array}{c} \mbox{Variable} \\ \mbox{High} \\ \mbox{Medium} \\ \mbox{Low} \end{array}\right] \implies \left[ \begin{array}{ccc} High & Medium & Low \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array} \right] \]

Question: Which problems may face this approach?

Categorical variables

x = rep(c('a', 'b'), 3)
y = rep(c(1, 2), 3)
print(x)
## [1] "a" "b" "a" "b" "a" "b"
print(y)
## [1] 1 2 1 2 1 2
lm(y~x) 
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)           xb  
##           1            1
  • factor types for categorical variables

Checkpoint !!!

Summary:

  • Coefficients may not be comparable if using different measurement units \(\implies\) variable normalization
  • Use variable transformations
  • Beware of multicolinearity (producing large confidence intervals)
  • How to treat categorical variables

Overfitting

Simulation

  • We usually get data and we analyze it.
  • Since we don’t know how data was generated, sometimes we may find inconclusive answers
  • Alternative: we create our data and the we analyze it
  • In this way we can answer methodological questions

Exercise - Building a linear model

  1. Use set.seed
  2. Generate a sample of lenght \(n = 10\) from variable \(x_1\) using rnorm with standard deviation \(\delta_x = 2\) and matrix function.
  3. Generate coefficients \(a\) and \(b\) (randomly or not)
  4. Generate variable \(y = a + b x_1 + \varepsilon\) with \(\varepsilon\) generated with rnorm with standard deviation \(\delta_y = 0.1\)
  5. Fit the model with lm
  6. See how confidence intervals and the residual standard error change when you try for differents \(\delta_y in \{.2, .3, .4\}\) and for sample size \(n=20\).

Exercise - Adding variables

  1. For sample size \(n = 10\) generate \(4\) variables \(x_2, ..., x_5\) randomly. What do you expect will happen when adding them to the model?
  2. Fit the model with these extra variables
  3. For sample size \(n = 10\) generate \(8\) variables \(x_2, ..., x_9\) randomly. What do you expect will happen when adding them to the model?
  4. Fit the model with these extra variables

Exercise - Testing predictions

  1. Generate new data sets with \(4\) and \(10\) variables.
  2. Use predict in both with previously trained models and check the predictions accuracy

Complexity adjustment

Model complexity trade-off

Cross validation

Checkpoint!!

Summary:

For each data set we have to adjust the complexity of the model

We can manage overfitting with cross validation

Logistic Regression

When is it necessary?

  • What if our repsonse variable is dichotomous \(0, 1\)?
  • Can you name situations where this happens?
  • What would happen if you use linear regression?

Logistic function

Logistic regression

  • Logistic function: \(p(s) = \frac{1}{1 + e^{-s}}\)
  • Logistic transformation \(p(x) = \frac{1}{1 + e^{- (\alpha_0 + \alpha_1 x_1)}}\)
  • Also useful for multivariate data

Estimation

  • Estimated using maximum likelihood
  • If there is perfect separation, the numerical method fails.
  • Multicolinearity also affects coefficient estimates
  • Prediction is a probability

In R

Using the Default dataset

summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

In R

logistic_model <- glm(default~student+balance+income,family="binomial",data=Default)
summary(logistic_model)
## 
## Call:
## glm(formula = default ~ student + balance + income, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

Prediction

predict(logistic_model, Default[1:2, ], type="response")
##           1           2 
## 0.001428724 0.001122204

Evaluating predictions

preds <- predict(logistic_model, Default, type='response')
results <- data.frame(Default$default, preds = preds)
head(results[ order(-results$preds), ])
##      Default.default     preds
## 8496             Yes 0.9776263
## 1137             Yes 0.9739873
## 7816             Yes 0.9662212
## 6076             Yes 0.9565568
## 9874              No 0.9525549
## 1161             Yes 0.9471594

AUC (Area Under the Curve): Oredered by the predictions, which is the probability that choosing a positive and a negative at random, the positive will be over the negative.

  • Same result with linear and probability predictions

Evaluating predictions in R

library(pROC)
roc_obj <- roc(Default$default, preds)
auc(roc_obj)
## Area under the curve: 0.9496

Exercise

  1. (Extra) Use the shiftfunction from the data.table package to build lagged variables
  2. Use the Smarket dataset from ILSR to build a logistic model for predicting Direction in stock markets.
  3. Separate randomly train and test (80% and 20%)
  4. Fit logistic model in train and evaluate in test using auc

Homework

Execise

We will work on the Carseats dataset from the ISLR package: carseat sales. Using linear models, perform a 5-fold cross validation (80% train, 20% test) with differents sets of variables. Which is the best prediction accuracy that you can get?